Search CORE

169 research outputs found

Auto-encoders: reconstruction versus compression

Author: Ollivier Yann
Publication venue
Publication date: 01/01/2014
Field of study

We discuss the similarities and differences between training an auto-encoder to minimize the reconstruction error, and training the same auto-encoder to compress the data via a generative model. Minimizing a codelength for the data using an auto-encoder is equivalent to minimizing the reconstruction error plus some correcting terms which have an interpretation as either a denoising or contractive property of the decoding function. These terms are related but not identical to those used in denoising or contractive auto-encoders [Vincent et al. 2010, Rifai et al. 2011]. In particular, the codelength viewpoint fully determines an optimal noise level for the denoising criterion

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

HAL-Rennes 1

Online Natural Gradient as a Kalman Filter

Author: Ollivier Yann
Publication venue
Publication date: 11/12/2017
Field of study

We cast Amari's natural gradient in statistical learning as a specific case of Kalman filtering. Namely, applying an extended Kalman filter to estimate a fixed unknown parameter of a probabilistic model from a series of observations, is rigorously equivalent to estimating this parameter via an online stochastic natural gradient descent on the log-likelihood of the observations. In the i.i.d. case, this relation is a consequence of the "information filter" phrasing of the extended Kalman filter. In the recurrent (state space, non-i.i.d.) case, we prove that the joint Kalman filter over states and parameters is a natural gradient on top of real-time recurrent learning (RTRL), a classical algorithm to train recurrent models. This exact algebraic correspondence provides relevant interpretations for natural gradient hyperparameters such as learning rates or initialization and regularization of the Fisher information matrix.Comment: 3rd version: expanded intr

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

Unbiasing Truncated Backpropagation Through Time

Author: Ollivier Yann
Tallec Corentin
Publication venue
Publication date: 23/05/2017
Field of study

Truncated Backpropagation Through Time (truncated BPTT) is a widespread method for learning recurrent computational graphs. Truncated BPTT keeps the computational benefits of Backpropagation Through Time (BPTT) while relieving the need for a complete backtrack through the whole data sequence at every step. However, truncation favors short-term dependencies: the gradient estimate of truncated BPTT is biased, so that it does not benefit from the convergence guarantees from stochastic gradient theory. We introduce Anticipated Reweighted Truncated Backpropagation (ARTBP), an algorithm that keeps the computational benefits of truncated BPTT, while providing unbiasedness. ARTBP works by using variable truncation lengths together with carefully chosen compensation factors in the backpropagation equation. We check the viability of ARTBP on two tasks. First, a simple synthetic task where careful balancing of temporal dependencies at different scales is needed: truncated BPTT displays unreliable performance, and in worst case scenarios, divergence, while ARTBP converges reliably. Second, on Penn Treebank character-level language modelling, ARTBP slightly outperforms truncated BPTT

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

HAL-Rennes 1

A curved Brunn--Minkowski inequality on the discrete hypercube

Author: Ollivier Yann
Villani Cédric
Publication venue
Publication date: 19/05/2011
Field of study

We compare two approaches to Ricci curvature on non-smooth spaces, in the case of the discrete hypercube

\{0,1\}^N

. While the coarse Ricci curvature of the first author readily yields a positive value for curvature, the displacement convexity property of Lott, Sturm and the second author could not be fully implemented. Yet along the way we get new results of a combinatorial and probabilistic nature, including a curved Brunn--Minkowski inequality on the discrete hypercube.Comment: Latest version: improved constants, minor correction

arXiv.org e-Print Archive

HAL-CentraleSupelec

HAL-UJM

INRIA a CCSD electronic archive server

Hal-Diderot

HAL-Rennes 1

Layer-wise learning of deep generative models

Author: Arnold Ludovic
Ollivier Yann
Publication venue
Publication date: 16/02/2013
Field of study

When using deep, multi-layered architectures to build generative models of data, it is difficult to train all layers at once. We propose a layer-wise training procedure admitting a performance guarantee compared to the global optimum. It is based on an optimistic proxy of future performance, the best latent marginal. We interpret auto-encoders in this setting as generative models, by showing that they train a lower bound of this criterion. We test the new learning procedure against a state of the art method (stacked RBMs), and find it to improve performance. Both theory and experiments highlight the importance, when training deep architectures, of using an inference model (from data to hidden variables) richer than the generative model (from hidden variables to data)

arXiv.org e-Print Archive

HAL-CentraleSupelec

INRIA a CCSD electronic archive server

Objective Improvement in Information-Geometric Optimization

Author: Akimoto Youhei
Ollivier Yann
Publication venue
Publication date: 01/01/2013
Field of study

Information-Geometric Optimization (IGO) is a unified framework of stochastic algorithms for optimization problems. Given a family of probability distributions, IGO turns the original optimization problem into a new maximization problem on the parameter space of the probability distributions. IGO updates the parameter of the probability distribution along the natural gradient, taken with respect to the Fisher metric on the parameter manifold, aiming at maximizing an adaptive transform of the objective function. IGO recovers several known algorithms as particular instances: for the family of Bernoulli distributions IGO recovers PBIL, for the family of Gaussian distributions the pure rank-mu CMA-ES update is recovered, and for exponential families in expectation parametrization the cross-entropy/ML method is recovered. This article provides a theoretical justification for the IGO framework, by proving that any step size not greater than 1 guarantees monotone improvement over the course of optimization, in terms of q-quantile values of the objective function f. The range of admissible step sizes is independent of f and its domain. We extend the result to cover the case of different step sizes for blocks of the parameters in the IGO algorithm. Moreover, we prove that expected fitness improves over time when fitness-proportional selection is applied, in which case the RPP algorithm is recovered

arXiv.org e-Print Archive

HAL-CentraleSupelec

CiteSeerX

INRIA a CCSD electronic archive server

Hal-Diderot